Measurements of Spoken Language Variability in a Multilingual Corpus. Predictable Aspects

نویسنده

  • Massimo Moneglia
چکیده

The paper provides cross-linguistic measurements of everyday language use based on the C-ORAL-ROM multilingual corpus of spontaneous speech. The average and the variation coefficient of a series of standard parameters are provided, faced to the main sociological and structural contexts of spoken language use. Mid-Length of Utterances (MLU); Mid-Length of the dialogic turn (MLTw); Speed; Mid length of the tone unit (MLTone); Fragmentation. Such variation parameters show strong predictable characters at cross-linguistic level. MLU has a positive correlation with MLTw and is shows highly predictable values in informal dialogic structures. Both MLU and MLTw have an inverse correlation with Speed. MLTone and Speed are predictable according to language specific features, but while MLTone have low intra-linguistic variation, Speed record a cross-linguistic tendency to lower values in formal language uses. Fragmentation is a permanent feature of spoken language, but it varies mainly according with speakers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Spoken Language Corpus Development for Communication Research

Multilingual spoken language corpora are indispensable for research on areas of spoken language communication, such as speech-to-speech translation. The speech and natural language processing essential to multilingual spoken language research requires unified structure and annotation, such as tagging. In this study, we describe an experience with multilingual spoken language corpus development ...

متن کامل

Vague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation

This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...

متن کامل

The Development of the Multilingual LUNA Corpus for Spoken Language System Porting

The development of annotated corpora is a critical process in the development of speech applications for multiple target languages. While the technology to develop a monolingual speech application has reached satisfactory results (in terms of performance and effort), porting an existing application from a source language to a target language is still a very expensive task. In this paper we addr...

متن کامل

Multilingual Aspects of Monolingual Corpora

If someone would collect opinions among the computational linguists what had been the most important trend in linguistics in the last decade, it is highly probable that the majority would answer that it was the massive use of large natural language corpora in many linguistic fields. The concept of collecting large amounts of written or spoken natural language data has become extremely important...

متن کامل

Multilingual corpora for speech-to-speech translation research

Multilingual spoken language corpora are indispensable for developing new speech-to-speech machine translation (S2SMT) technologies. This paper first discusses characteristics that corpora for S2SMT should have, then surveys existing corpora. Finally, it compares these corpora.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004